Chapter 9 GWR and spatially lagged regression
9.1 Recommended listening
Some of these practicals are long, take regular breaks and have a listen to some of our fav tunes each week.
AdamThis week it’s Radio1’s DnB 60, recorded live at the Warehouse Project. Die, Bryan G, Jumpin’ Jack Frost and it sounds like Dynamite on the mic — some proper rollers in here!
9.2 Introduction
In this practical you will be introduced to a suite of different models that will allow you to test a variety of research questions and hypotheses through modelling the associations between two or more spatially reference variables.
In the worked example, we will explore the factors that might affect the average exam scores of 16 year-old across London. GSCEs are the exams taken at the end of secondary education and here have been aggregated for all pupils at their home addresses across the City for Ward geographies.
The London Data Store collates a range of other variables for each Ward and so we will see if any of these are able to help explain the patterns of exam performance that we see.
This practical will walk you through the common steps that you should go through when building a regression model using spatial data to test a stated research hypothsis; from carrying out some descriptive visualisation and summary statistics, to interpreting the results and using the outputs of the model to inform your next steps.
If you want to do some more reading around regression, then Chapter 5 of the Modern Dive book is pretty great.
9.2.1 Setting up your Data
First, let’s set up R and read in some data to enable us to carry out our analysis.
#library a bunch of packages we may (or may not) use - install them first if not installed already.
library(tidyverse)
library(tmap)
library(geojsonio)
library(plotly)
library(rgdal)
library(broom)## Warning: package 'broom' was built under R version 3.6.2
## Warning: package 'car' was built under R version 3.6.2
## [1] "C:/Users/ucfnmac/OneDrive - University College London/Teaching/CASA0005repo"
#read some ward data in
#download a zip file containing some boundaries we want to use
download.file("https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip", destfile="prac9_data/statistical-gis-boundaries-london.zip")
#unzip it
unzip("prac9_data/statistical-gis-boundaries-london.zip", exdir="prac9_data")
#look at the subfolders in the directory to work out where we want to get our shapefile from
list.dirs("prac9_data/statistical-gis-boundaries-london")## [1] "prac9_data/statistical-gis-boundaries-london"
## [2] "prac9_data/statistical-gis-boundaries-london/ESRI"
## [3] "prac9_data/statistical-gis-boundaries-london/MapInfo"
#read in the boundaries from the file you have just unzipped into your working directory
LondonWardsss <- readOGR("prac9_data/statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.shp", layer="London_Ward_CityMerged")## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\ucfnmac\OneDrive - University College London\Teaching\CASA0005repo\prac9_data\statistical-gis-boundaries-london\ESRI\London_Ward_CityMerged.shp", layer: "London_Ward_CityMerged"
## with 625 features
## It has 7 fields
## Integer64 fields read as strings: POLY_ID
#convert it to a simple features object
LondonWardsssSF <- st_as_sf(LondonWardsss)
#check coordinate reference system
LondonWardsssSF## Simple feature collection with 625 features and 7 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.1502,0.247,0.8421,-20.4894 +units=m +no_defs
## First 10 features:
## NAME GSS_CODE HECTARES NONLD_AREA LB_GSS_CD
## 0 Chessington South E05000405 755.173 0 E09000021
## 1 Tolworth and Hook Rise E05000414 259.464 0 E09000021
## 2 Berrylands E05000401 145.390 0 E09000021
## 3 Alexandra E05000400 268.506 0 E09000021
## 4 Beverley E05000402 187.821 0 E09000021
## 5 Coombe Hill E05000406 442.170 0 E09000021
## 6 Chessington North and Hook E05000404 192.980 0 E09000021
## 7 Surbiton Hill E05000413 166.482 0 E09000021
## 8 Old Malden E05000410 180.016 0 E09000021
## 9 St. Mark's E05000412 137.578 0 E09000021
## BOROUGH POLY_ID geometry
## 0 Kingston upon Thames 50840 POLYGON ((516401.6 160201.8...
## 1 Kingston upon Thames 117160 POLYGON ((517829.6 165447.1...
## 2 Kingston upon Thames 50449 POLYGON ((518107.5 167303.4...
## 3 Kingston upon Thames 50456 POLYGON ((520480 166909.8, ...
## 4 Kingston upon Thames 117161 POLYGON ((522071 168144.9, ...
## 5 Kingston upon Thames 117159 POLYGON ((522007.6 169297.3...
## 6 Kingston upon Thames 50530 POLYGON ((517175.3 164077.3...
## 7 Kingston upon Thames 50457 POLYGON ((517469.3 166878.5...
## 8 Kingston upon Thames 50455 POLYGON ((522231.1 166015, ...
## 9 Kingston upon Thames 50450 POLYGON ((517460.6 167802.9...
#Proj4 string tells me it's in wgs84, so Convert to British National Grid
BNG = "+init=epsg:27700"
LondonWardsssSFBNG <- st_transform(LondonWardsssSF, BNG)
#check the data
qtm(LondonWardsssSFBNG)
Now we are going to read in some data from the London Data Store - https://data.london.gov.uk/
#read in some attribute data
LondonWardProfiles <- read_csv("https://data.london.gov.uk/download/ward-profiles-and-atlas/772d2d64-e8c6-46cb-86f9-e52b4c7851bc/ward-profiles-excel-version.csv", col_names = TRUE, locale = locale(encoding = 'Latin1'))9.2.1.1 Cleaning the data as you read it in
Examining the dataset as it is read in above, you can see that a number of fields in the dataset that should have been read in as numeric data, have actually been read in as character (text) data.
If you examine your data file, you will see why. In a number of columns where data are missing, rather than a blank cell, the values ‘n/a’ have been entered in instead. Where these text values appear amongst numbers, the software will automatically assume the whole column is text.
To deal with these errors, we can force read_csv to ignore these values by telling it what values to look out for that indicate missing data
#We can use readr to deal with the issues in this dataset - which are to do with text values being stored in columns containing numeric values
#read in some data - couple of things here. Read in specifying a load of likely 'n/a' values, also specify Latin1 as encoding as there is a pound sign (£) in one of the column headers - just to make things fun!
LondonWardProfiles <- read_csv("https://data.london.gov.uk/download/ward-profiles-and-atlas/772d2d64-e8c6-46cb-86f9-e52b4c7851bc/ward-profiles-excel-version.csv", na = c("", "NA", "n/a"), locale = locale(encoding = 'Latin1'), col_names = TRUE)## Parsed with column specification:
## cols(
## .default = col_double(),
## `Ward name` = col_character(),
## `Old code` = col_character(),
## `New code` = col_character()
## )
## See spec(...) for full column specifications.
Now you have read in both your boundary data and your attribute data, you need to merge the two together using a common ID. In this case, we can use the ward codes to achieve the join
#merge boundaries and data
LonWardProfiles <- left_join(LondonWardsssSFBNG, LondonWardProfiles, by = c("GSS_CODE" = "New code"))## Warning: Column `GSS_CODE`/`New code` joining factor and character vector,
## coercing into character vector
## tmap mode set to interactive viewing
9.2.1.2 Additional Data
In addition to our main datasets, it might also be useful to add some contextual data. While our exam results have been recorded at the home address of students, most students would have attended one of the schools in the City.
Let’s add some schools data as well.
#might be a good idea to see where the secondary schools are in London too
london_schools <- read_csv("https://data.london.gov.uk/download/london-schools-atlas/57046151-39a0-45d9-8dc0-27ea7fd02de8/all_schools_xy_2016.csv")## Parsed with column specification:
## cols(
## .default = col_character(),
## OBJECTID = col_number(),
## URN = col_double(),
## EASTING = col_double(),
## NORTHING = col_double(),
## NEW_URN = col_logical(),
## OLD_URN = col_logical(),
## map_icon_l = col_double(),
## Primary = col_double(),
## x = col_double(),
## y = col_double()
## )
## See spec(...) for full column specifications.
#from the coordinate values stored in the x and y columns, which look like they are latitude and longitude values, create a new points dataset
lon_schools_sf <- st_as_sf(london_schools, coords = c("x","y"), crs = 4326)
#now pull out the secondary schools
#these are the same - one uses grep() and one uses the stringr() package
lond_sec_schools_sf <- lon_schools_sf[str_which(lon_schools_sf[["PHASE"]],"Secondary"),]
lond_sec_schools_sf <- lon_schools_sf[grep("Secondary",lon_schools_sf[["PHASE"]]),]
tmap_mode("view")## tmap mode set to interactive viewing
9.3 Analysing GCSE exam performance - testing a research hypothesis
To explore the factors that might influence GCSE exam performance in London, we are going to run a series of different regression models. A regression model is simply the expression of a linear relationship between our outcome variable (Average GCSE score in each Ward in London) and another variable or several variables that might explain this outcome.
9.3.1 Research Question and Hypothesis
Examining the spatial distribution of GSCE point scores in the map above, it is clear that there is variation across the city. My research question is:
What are the factors that might lead to variation in Average GCSE point scores across the city?
My research hypothesis that I am going to test is that there are other observable factors occurring in Wards in London that might affect the average GCSE scores of students living in those areas.
In inferential statistics, we cannot definitively prove a hypothesis is true, but we can seek to disprove that there is absolutely nothing of interest occurring or no association between variables. The null hypothesis that I am going to test empirically with some models is that there is no relationship between exam scores and other observed variables across London.
9.3.2 Regression Basics
For those of you who know a bit about regression, you might want to skip down to the next section. However, if you are new to regression or would like a refresher, read on…
The linear relationship in a regression model is probably most easily explained using a scatter plot:
q <- qplot(x = `Unauthorised Absence in All Schools (%) - 2013`, y = `Average GCSE capped point scores - 2014`, data=LonWardProfiles)
#plot with a regression line - note, I've added some jitter here as the x-scale is rounded
q + stat_smooth(method="lm", se=FALSE, size=1) + geom_jitter()
Here, I have plotted the average GCSE point score for each Ward in London against another variable in the dataset that I think might be influential: the % of school days lost to unauthorised absences in each ward.
Remember that my null hypothesis would be that there is no relationship between GCSE scores and unauthorised absence from school. If this null hypothesis was true, then I would not expect to see any pattern in the cloud of points plotted above.
As it is, the scatter plot shows that, generally, as the \(x\) axis independent variable (unauthorised absence) goes up, our \(y\) axis dependent variable (GCSE point score) goes down. This is not a random cloud of points, but something that indicates there could be a relationshp here and so I might be looking to reject my null hypothesis.
Some conventions - In a regression equation, the dependent variable is always labelled \(y\) and shown on the \(y\) axis of a graph, the predictor or independent variable(s) is(are) always shown on the \(x\) axis.
I have added a blue line of best-fit - this is the line that can be drawn by minimising the sum of the squared differences between the line and the residuals. The residuals are all of the dots not falling exactly on the blue line. An algorithm known as ‘ordinary least squares’ (OLS) is used to draw this line and it simply tries a selection of different lines until the sum of the squared divations between all of the residuals and the blue line is minimised, leaving the final solution.
As a general rule, the better the blue line is at summarising the relationship between \(y\) and \(x\), the better the model.
The equation for the blue line in the graph above can be written:
(1)\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]
where:
\(\beta_0\) is the intercept (the value of \(y\) when \(x = 0\) - somewhere around 370 on the graph above);
\(\beta_1\) is sometimes referred to as the ‘slope’ parameter and is simply the change in the value of \(y\) for a 1 unit change in the value of \(x\) (the slope of the blue line) - reading the graph above, the change in the value of \(y\) reading between 1.0 and 2.0 on the \(x\) axis looks to be around -40.
\(\epsilon_i\) is a random error term (positive or negative) that should sum to 0 - esentially, if you add all of the vertical differences between the blue line and all of the residuals, it should sum to 0.
Any value of \(y\) along the blue line can be modelled using the corresponding value of \(x\) and these parameter values. Examining the graph above we would expect the average GCSE point score for a student living in a Ward where 0.5% of school days per year were missed, to equal around 350, but we can confirm this by plugging the \(\beta\) parameter values and the value of \(x\) into equation (1):
## [1] 350
9.3.3 Running a Regression Model in R
In the graph above, I used a method called ‘lm’ in the stat_smooth function in ggplot2 to draw the regression line. ‘lm’ stands for ‘linear model’ and is a standard function in R for running linear regression models. Use the help system to find out more about lm - ?lm
Below is the code that could be used to draw the blue line in our scatter plot. Note, the tilde ~ symbol means “is modelled by”
#run the linear regression model and store its outputs in an object called model1
model1 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`, data = LonWardProfiles)
#show the summary of those outputs
summary(model1)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`,
## data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.753 -10.223 -1.063 8.547 61.842
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 371.471 2.165 171.6
## `Unauthorised Absence in All Schools (%) - 2013` -41.237 1.927 -21.4
## Pr(>|t|)
## (Intercept) <2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013` <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.39 on 624 degrees of freedom
## Multiple R-squared: 0.4233, Adjusted R-squared: 0.4224
## F-statistic: 458 on 1 and 624 DF, p-value: < 2.2e-16
9.3.3.1 Interpreting and using the model outputs
In running a regression model, we are effectively trying to test (disprove) our null hypothesis. If our null hypothsis was true, then we would expect our coefficients to = 0.
In the output summary of the model above, there are a number of features you should pay attention to:
Coefficient Estimates - these are the \(\beta_0\) (intercept) and \(\beta_1\) (slope) parameter estimates from Equation 1. You will notice that at \(\beta_0 = 371.471\) and \(\beta_1 = -41.237\) they are pretty close to the estimates of 370 and -40 that we read from the graph earlier, but more precise.
Coefficient Standard Errors - these represent the average amount the coefficient varies from the average value of the dependent variable (its standard deviation). So, for a 1% increase in unauthorised absence from school, while the model says we might expect GSCE scores to drop by -41.2 points, this might vary, on average, by about 1.9 points. As a rule of thumb, we are looking for a lower value in the standard error relative to the size of the coefficient.
Coefficient t-value - this is the value of the coefficient divided by the standard error and so can be thought of as a kind of standardised coefficient value. The larger (either positive or negative) the value the greater the relative effect that particular independent variable is having on the dependent variable (this is perhaps more useful when we have several independent variables in the model) .
Coefficient p-value - Pr(>|t|) - the p-value is a measure of significance. There is lots of debate about p-values which I won’t go into here, but essentially it refers to the probability of getting a coefficient as large as the one observed in a set of random data. p-values can be thought of as percentages, so if we have a p-value of 0.5, then there is a 5% chance that our coefficient could have occurred in some random data, or put another way, a 95% chance that out coefficient could have only occurred in our data.
As a rule of thumb, the smaller the p-value, the more significant that variable is in the story and the smaller the chance that the relationship being observed is just random. Generally, statisticians use 5% or 0.05 as the acceptable cut-off for statistical significance - anything greater than that we should be a little sceptical about.
In r the codes ***, **, **, . are used to indicate significance. We generally want at least a single * next to our coefficient for it to be worth considering.
R-Squared - This can be thought of as an indication of how good your model is - a measure of ‘goodness-of-fit’ (of which there are a number of others). \(r^2\) is quite an intuitite measure of fit as it ranges between 0 and 1 and can be thought of as the % of variation in the dependent variable (in our case GCSE score) explained by variation in the independent variable(s). In our example, an \(r^2\) value of 0.42 indicates that around 42% of the variation in GCSE scores can be explained by variation in unathorised absence from school. In other words, this is quite a good model. The \(r^2\) value will increase as more independent explanatory variables are added into the model, so where this might be an issue, the adjusted r-squared value can be used to account for this affect
9.3.3.2 Further information
There are other features of the model outputs (residual degrees of freedom, standard error etc.) that I won’t go into here - for more detail about interpreting the model outputs, try this resource:
http://blog.yhat.com/posts/r-lm-summary.html
In addtion, the two papers by Dunn (1989) and Jones (1984) on Moodle are old, but very good!
9.3.3.3 Saving your model outputs
The standard model output in R is not particularly easy to export for publication, however you can use the tidy() function in the broom package to save the results out to a dataframe:
9.3.3.4 Provisional conclusions from the first model
From Model 1 above, I am provisionally able to conclude that there is a moderate negative association between GSCE exam scores and the level of unauthorised absence in Wards across London. This relationship is significant at the 99% level of confidence that the relationship observed is not random chance. I can also say that around 43% of the variation in GCSE point scores can be attributed to variation in school absence.
However!
The conclusion above is currently only provisional as I have not checked that the model I have built meets a series of important assumptions. If not met, breaking any of these assumptions will throw the conclusions that I draw into doubt.
Therefore, these will be checked in turn.
9.3.4 Assumptions Underpinning Linear Regression
9.3.5 Assumption 1 - There is a linear relationship between the dependent and Independent variables
The best way to test for this assumption is to plot a scatter plot similar to the one created earlier. It may not always be practical to create a series of scatter plots, so one quick way to check that a linear relationship is probable is to look at the frequency distributions of the variables. If they are normally distributed, then there is a good chance that if the two variables are in some way correlated, this will be a linear relationship.
For example, look at the frequency distributions of our two variables earlier:
#let's check the distribution of these variables first
ggplot(LonWardProfiles, aes(x=`Average GCSE capped point scores - 2014`)) + geom_histogram(aes(y = ..density..),binwidth = 5) + geom_density(colour="red", size=1, adjust=1)

We would describe both of these distribution as being relatively ‘normally’ or gaussian disributed - https://en.wikipedia.org/wiki/Normal_distribution - and thus more likely to have a linear correlation (if they are indeed associated).
Contrast this with the median house price variable:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We would describe this as a positively ‘skewed’ - https://en.wikipedia.org/wiki/Skewness - distribution, i.e. there are more observations towards the lower end of the average house prices observed in the city, however there is a long tail to the distribution, i.e. there are a small number of wards where the average house price is very large indeed.
If we plot the raw house price variable against GCSE scores, we get the following scatter plot:

This indicates that we do not have a linear relationship, indeed it suggests that this might be a curvilinear relationship.
9.3.5.1 Transforming variables
One way that we might be able to achieve a linear relationship between our two variables is to transform the non-normally distributed variable so that it is more normally distributed.
There is some debate as to whether this is a wise thing to do as, amongst other things, the coefficients for transformed variables are much harder to interpret, however, we will have a go here to see if it makes a difference.
Tukey’s ladder of transformations
You might be asking how we could go about transforming our variables. In 1977, Tukey described a series of power transformations that could be applied to a variable to alter its frequency distribution.
In regression analysis, you analysts will frequently take the log of a variable to change its distribution, but this is a little crude and may not result in a completely normal distribution. For example, we can take the log of the house price variable:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This looks a little more like a normal distribution, but it is still a little skewed.
Fortunately in R, we can use the symbox() function in the car package to try a range of transfomations along Tukey’s ladder:

Observing the plot above, it appears that raising our house price variable to the power of -1 should lead to a more normal distribution:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Compare this with the logged transformation:

9.3.5.2 Should I transform my variables?
The decision is down to you as the modeller - it might be that a transformation doesn’t succeed in normalising the distribtuion of your data or that the interpretation after the transformation is problematic, however it is important not to violate the assumptions underpinning the regression model or your conclusions may be on shaky ground.
9.3.6 Assumption 2 - The residuals in your model should be normally distributed
This assumption is easy to check. When we ran our Model1 earlier, one of the outputs stored in our Model 1 object is the residual value for each case (Ward) in your dataset. We can plot these as a histogram and see if there is a normal distribution:
#save the residuals into your dataframe
LonWardProfiles$model1_resids <- model1$residuals
qplot(model1$residuals) + geom_histogram() ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Examining the histogram above, we can be happy that our residuals look to be relatively normally distributed.
9.3.7 Assumption 3 - No Multicolinearity in the independent variables
Now, the regression model we have be experimenting with so far is a simple bivariate (two variable) model. One of the nice things about regression modelling is while we can only easily visualise linear relationships in a two (or maximum 3) dimension scatter plot, mathematically, we can have as many dimensions / variables as we like.
As such, we could extend model 1 into a multiple regression model by adding some more explanatory variables that we think could affect GSCE scores. Let’s try the log or ^-1 transformed house price variable from earlier (the rationalle being that higher house prices indicate more affluence and therefore, potentially, more engagement with education):
model2 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`), data = LonWardProfiles)
#show the summary of those outputs
summary(model2)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`), data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.257 -10.219 -0.468 8.828 50.821
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 201.517 20.093 10.029
## `Unauthorised Absence in All Schools (%) - 2013` -36.206 1.919 -18.869
## log(`Median House Price (£) - 2014`) 12.786 1.504 8.503
## Pr(>|t|)
## (Intercept) <2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013` <2e-16 ***
## log(`Median House Price (£) - 2014`) <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.53 on 623 degrees of freedom
## Multiple R-squared: 0.4833, Adjusted R-squared: 0.4816
## F-statistic: 291.3 on 2 and 623 DF, p-value: < 2.2e-16
#and for future use, write the residuals out to a column in your dataframe
LonWardProfiles$model2_resids <- residuals(model2)Examining the output above, it is clear that including median house price into our model improves the fit from an \(r^2\) of around 42% to an \(r^2\) of 48%. Median house price is also a statistically significant variable.
But do our two explanatory variables satisfy the no-multicoliniarity assumption? If not and the variables are highly correlated, then we are effectively double counting the influence of these variables and overstating their explanatory power.
To check this, we can compute the Pearson correlation coefficient between the variables. In an ideal world, we would be looking for something less than a 0.8 correlation
## corrplot 0.84 loaded
#first drop the geometry column from the dataframe as it will cause problems
tempdf <- st_set_geometry(LonWardProfiles,NULL)
#pull out the columns we want to check for multicolinearity
tempdf <- tempdf[,c("Median House Price (£) - 2014", "Unauthorised Absence in All Schools (%) - 2013")]
#log houseprice
tempdf[1] <- log(tempdf[1])
#rename the columns to something shorter
names(tempdf) <- c("Med House Price", "Un Auth Absence")
#compute the correlation matrix for the two variables of interest
cormat <- cor(tempdf[,1:2], use="complete.obs", method="pearson")
#visualise the correlation matrix
corrplot(cormat)
Looking at either the correlation matrix or the correlation plot of that matrix, it’s easy to see that there is a low correlation (around 30%) between our two independent variables.
9.3.7.1 Variance Inflation Factor (VIF)
Another way that we can check for Multicolinearity is to examine the VIF for the model. If we have VIF values for any variable exceeding 10, then we may need to worry and perhaps remove that variable from the analysis:
## `Unauthorised Absence in All Schools (%) - 2013`
## 1.105044
## log(`Median House Price (£) - 2014`)
## 1.105044
Both the correlation plots and examination of VIF indicate that our multiple regression model meets the assumptions around multicollinearity and so we can proceed further.
If we wanted to add more variables into our model, it would be useful to check for multicollinearity amongst every variable we want to include, we can do this by computing a correlation matrix for the whole dataset or checking the VIF after running the model:
tempdf <- st_set_geometry(LonWardProfiles,NULL)
cormat <- cor(tempdf[,10:74], use="complete.obs", method="pearson")
str(tempdf)## 'data.frame': 626 obs. of 75 variables:
## $ NAME : Factor w/ 604 levels "Abbey","Abbey Road",..: 102 531 33 9 36 130 101 508 394 490 ...
## $ GSS_CODE : chr "E05000405" "E05000414" "E05000401" "E05000400" ...
## $ HECTARES : num 755 259 145 269 188 ...
## $ NONLD_AREA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LB_GSS_CD : Factor w/ 33 levels "E09000001","E09000002",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ BOROUGH : Factor w/ 33 levels "Barking and Dagenham",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ POLY_ID : Factor w/ 625 levels "116694","116695",..: 461 181 321 327 182 180 354 328 326 322 ...
## $ Ward name : chr "Kingston upon Thames - Chessington South" "Kingston upon Thames - Tolworth and Hook Rise" "Kingston upon Thames - Berrylands" "Kingston upon Thames - Alexandra" ...
## $ Old code : chr "00AXGC" "00AXGM" "00AXFY" "00AXFX" ...
## $ Population - 2015 : num 10550 10650 9800 9700 10450 ...
## $ Children aged 0-15 - 2015 : num 2200 2150 1800 1900 2200 1950 1800 1900 1850 1350 ...
## $ Working-age (16-64) - 2015 : num 6800 7050 6500 6200 6900 7150 5800 7550 6100 8650 ...
## $ Older people aged 65+ - 2015 : num 1500 1450 1500 1600 1350 1600 1400 1450 1750 900 ...
## $ % All Children aged 0-15 - 2015 : num 20.9 20.1 18.2 19.7 20.8 18.3 20.1 17.4 19 12.2 ...
## $ % All Working-age (16-64) - 2015 : num 64.6 66.1 66.4 63.7 66 66.8 64.6 69.2 63 79.4 ...
## $ % All Older people aged 65+ - 2015 : num 14.4 13.8 15.4 16.6 13.1 14.9 15.3 13.5 18.1 8.3 ...
## $ Mean Age - 2013 : num 37.6 37.3 38.3 38.9 37.3 37.1 38.3 37.7 39.9 33.6 ...
## $ Median Age - 2013 : num 37 36 36 39 37 34 38 35 40 30 ...
## $ Area - Square Kilometres : num 7.6 2.6 1.5 2.7 1.9 4.4 1.9 1.7 1.8 1.4 ...
## $ Population density (persons per sq km) - 2013 : num 1375 3962 6467 3537 5447 ...
## $ % BAME - 2011 : num 13 27.2 18.1 29.8 32.6 35.5 15 19 30.8 21.3 ...
## $ % Not Born in UK - 2011 : num 14.3 26.1 24.1 27.2 32.4 40.6 17.2 25 27.3 29.1 ...
## $ % English is First Language of no one in household - 2011 : num 3.9 8 6.9 8.2 13.3 13.9 5 7 8.4 8.9 ...
## $ General Fertility Rate - 2013 : num 64.1 59.1 56.8 61.6 75 44.7 55 60 58.4 46.9 ...
## $ Male life expectancy -2009-13 : num 83.3 80.3 79 83.4 80.8 83.2 80.4 79.7 83.6 81.7 ...
## $ Female life expectancy -2009-13 : num 83.9 86.7 83.3 86.9 81.8 84.8 86.8 82.6 87.4 84.3 ...
## $ % children in reception year who are obese - 2011/12 to 2013/14 : num 8.1 4.9 4.6 7.6 6.5 4.9 9.8 4.2 6.7 2.9 ...
## $ % children in year 6 who are obese- 2011/12 to 2013/14 : num 24.6 15.2 8.6 20.6 13.1 21.4 23.8 9.8 16.4 8.7 ...
## $ Rate of All Ambulance Incidents per 1,000 population - 2014 : num 109.7 115.2 102.8 92.4 117.9 ...
## $ Rates of ambulance call outs for alcohol related illness - 2014 : num 0.5 0.6 0.4 0.4 0.6 0.7 0.7 0.4 0.3 0.7 ...
## $ Number Killed or Seriously Injured on the roads - 2014 : num 0 3 1 3 0 4 2 2 1 3 ...
## $ In employment (16-64) - 2011 : num 5291 4993 4813 4464 4807 ...
## $ Employment rate (16-64) - 2011 : num 78.5 76.2 74.8 75.1 71.6 64.5 76.4 78.2 74 69.4 ...
## $ Number of jobs in area - 2013 : num 5300 5200 2000 2500 4500 7200 2000 2100 1100 4300 ...
## $ Employment per head of resident WA population - 2013 : num 0.8 0.8 0.3 0.4 0.7 1 0.3 0.3 0.2 0.5 ...
## $ Rate of new registrations of migrant workers - 2011/12 : num 6.8 22.7 15.9 17.6 30 34.7 10.4 19 15.6 25.3 ...
## $ Median House Price (£) - 2014 : num 315000 337195 361125 404975 435000 ...
## $ Number of properties sold - 2014 : num 201 153 172 136 167 161 137 237 134 286 ...
## $ Median Household income estimate (2012/13) : num 38310 37840 42330 41390 40700 ...
## $ Number of Household spaces - 2011 : num 4112 3836 4447 3443 4047 ...
## $ % detached houses - 2011 : num 11.2 6.4 10.5 9.3 9.6 29.5 6.6 16.2 18.7 4.2 ...
## $ % semi-detached houses - 2011 : num 26.4 46.8 25.7 56.3 33.7 16.8 46.6 18.8 45.9 16.5 ...
## $ % terraced houses - 2011 : num 41.3 24.5 9 21.5 26.3 11.6 29.5 7.3 16.9 9.1 ...
## $ % Flat, maisonette or apartment - 2011 : num 20.9 22 54.7 12 30.4 42.1 17.2 57.7 18.5 70 ...
## $ % Households Owned - 2011 : num 77.3 71.2 54.6 80.2 63.7 61 74.8 60.5 77.4 53.1 ...
## $ % Households Social Rented - 2011 : num 11.7 10.7 17.4 4.9 11.5 11.6 14.8 7.5 9.1 6.7 ...
## $ % Households Private Rented - 2011 : num 9.7 16.5 26.9 13.3 22.1 25.5 8.6 30.7 12.3 38.6 ...
## $ % dwellings in council tax bands A or B - 2015 : num 4.1 4.6 8.9 4 5.5 1.1 1.8 4.3 3.1 8.4 ...
## $ % dwellings in council tax bands C, D or E - 2015 : num 89.2 91.8 70.4 79.4 76.4 51.7 96.2 74.7 63.6 80.2 ...
## $ % dwellings in council tax bands F, G or H - 2015 : num 6.7 3.6 20.7 16.6 18.1 47.2 2 21 33.3 11.4 ...
## $ Claimant rate of key out-of-work benefits (working age client group) (2014): num 7.3 6.7 7.3 5.5 7.2 4.9 8.4 5.2 6.7 4.8 ...
## $ Claimant Rate of Housing Benefit (2015) : num 6.4 6.8 9.8 5.1 8.8 5.4 6.3 5.4 5.4 5.5 ...
## $ Claimant Rate of Employment Support Allowance - 2014 : num 3.2 3 4.2 3.1 3.6 2 3.7 2.6 2.6 2.8 ...
## $ Rate of JobSeekers Allowance (JSA) Claimants - 2015 : num 1.3 1.3 1.4 1 1.8 1.3 1.4 1.1 1.1 1.5 ...
## $ % dependent children (0-18) in out-of-work households - 2014 : num 13.1 10.2 6.8 6.3 9.2 8.2 10.9 9.3 8.9 4.3 ...
## $ % of households with no adults in employment with dependent children - 2011: num 3.9 3.3 2.3 2.7 4.1 3.9 3.6 2.3 3.7 1.6 ...
## $ % of lone parents not in employment - 2011 : num 39.6 38.2 36.9 31.2 43.4 50.2 35.9 39.3 39.2 40.8 ...
## $ (ID2010) - Rank of average score (within London) - 2010 : num 548 528 530 590 544 570 516 568 566 537 ...
## $ (ID2010) % of LSOAs in worst 50% nationally - 2010 : num 16.7 0 16.7 0 16.7 14.3 0 0 16.7 16.7 ...
## $ Average GCSE capped point scores - 2014 : num 321 338 343 353 372 ...
## $ Unauthorised Absence in All Schools (%) - 2013 : num 0.8 0.7 0.5 0.4 0.7 0.9 0.8 0.6 0.7 0.5 ...
## $ % with no qualifications - 2011 : num 19.1 18.4 11.9 15.8 15.2 10.7 21.9 10.7 15.7 6.7 ...
## $ % with Level 4 qualifications and above - 2011 : num 25.3 30 48.4 32.7 41.7 44.5 22.8 52.2 34.4 48.3 ...
## $ A-Level Average Point Score Per Student - 2013/14 : num 643 714 734 762 762 ...
## $ A-Level Average Point Score Per Entry; 2013/14 : num 204 213 213 215 219 ...
## $ Crime rate - 2014/15 : num 47.5 53.7 31.6 34.7 64.5 47.8 43.9 29.6 29.6 49.7 ...
## $ Violence against the person rate - 2014/15 : num 16.3 14.2 8.2 7.9 15.4 12.6 13.9 10.1 8 13.5 ...
## $ Deliberate Fires per 1,000 population - 2014 : num 0.6 0.7 0.3 0.3 0.7 0.3 1.2 0.3 0.1 0.4 ...
## $ % area that is open space - 2014 : num 75.7 33.1 10.9 43.7 26.4 42.1 27.5 5.8 21.3 23.9 ...
## $ Cars per household - 2011 : num 1.4 1.2 1 1.4 1 1.3 1.3 1 1.4 0.9 ...
## $ Average Public Transport Accessibility score - 2014 : num 2.4 2.3 2.8 2.2 2.8 2.5 2.4 3.3 2.7 3.7 ...
## $ % travel by bicycle to work - 2011 : num 2.3 2.9 4.6 3.9 4.4 3.2 2.5 3.7 2.3 4.2 ...
## $ Turnout at Mayoral election - 2012 : num 30.4 32.1 38 36.3 41.2 34.9 30.8 34.7 36.1 30.2 ...
## $ model1_resids : num -17.18 -5.11 -8.15 -1.68 29.69 ...
## $ model2_resids : num -13.12 -1.41 -4.33 1.18 30.13 ...

9.3.8 Assumption 4 - Homoscedasticity
Homoscedasticity means that the errors/residuals in the model exhibit constant / homogenous variance, if they don’t, then we would say that there is hetroscedasticity present. Why does this matter? Andy Field does a much better job of explaining this than I could ever do here - https://www.discoveringstatistics.com/tag/homoscedasticity/ - but essentially, if your errors do not have constant variance, then your parameter estimates could be wrong, as could the estimates of their significance.
The best way to check for homo/hetroscedasticity is to plot the residuals in the model against the predicted values. We are looking for a cloud of points with no apparent patterning to them.




In the series of plots above, the first plot (residuals vs fitted), we would hope to find a random cloud of points with a straight horizontal red line. Looking at the plot, the curved red line would suggest some hetroscedasticity, but the cloud looks quite random. Similarly we are looking for a random cloud of points with no apparent patterning or shape in the third plot of standardised residuals vs fitted values. Here, the cloud of points also looks fairly random, with perhaps some shaping indicated by the red line.
9.3.9 Assumption 5 - Independence of Errors
This assumption simply states that the residual values (errors) in your model must not be correlated in any way. If they are, then they exhibit autocorrelation which suggests that something might be going on in the background that we have not sufficiently accounted for in our model.
9.3.9.1 Standard Autocorrelation
If you are running a regression model on data that do not have explicit space or time dimensions, then the standard test for autocorrelation would be the Durbin-Watson test.
This tests whether residuals are correlated and produces a summary statistic that ranges between 0 and 4, with 2 signifiying no autocorrelation. A value greater than 2 suggesting negative autocorrelation and and value of less than 2 indicating postitve autocorrelation.
In his excellent text book - https://www.discoveringstatistics.com/books/discovering-statistics-using-r/ - Andy Field suggests that you should be concerned with Durbin-Watson test statistics <1 or >3. So let’s see:
## lag Autocorrelation D-W Statistic p-value
## 1 0.1930199 1.61264 0
## Alternative hypothesis: rho != 0
As you can see, the DW statistics for our model is 1.61, so some indication of autocorrelation, but perhaps nothing to worry about.
HOWEVER
We are using spatially referenced data and as such we should check for spatial-autocorrelation.
The first test we should carry out is to map the residuals to see if there are any apparent obvious patterns:
## tmap mode set to interactive viewing
#qtm(LonWardProfiles, fill = "model1_resids")
tm_shape(LonWardProfiles) +
tm_polygons("model2_resids",
palette = "RdYlBu") +
tm_shape(lond_sec_schools_sf) + tm_dots(col = "TYPE")## Variable "model2_resids" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Looking at the map above, there look to be some blue areas next to other blue areas and some red/orange areas next to other red/orange areas. This suggests that there could well be some spatial autocorrelation biasing our model, but can we test for spatial autocorrelation more systematically?
Yes - and some of you will remember this from the practical two weeks ago. We can calculate a number of different statistics to check for spatial autocorrelation - the most common of these being Moran’s I.
Let’s do that now.
#####
#Firstly convert our SF object into an SP object:
LonWardProfilesSP <- as(LonWardProfiles,"Spatial")
names(LonWardProfilesSP)## [1] "NAME"
## [2] "GSS_CODE"
## [3] "HECTARES"
## [4] "NONLD_AREA"
## [5] "LB_GSS_CD"
## [6] "BOROUGH"
## [7] "POLY_ID"
## [8] "Ward.name"
## [9] "Old.code"
## [10] "Population...2015"
## [11] "Children.aged.0.15...2015"
## [12] "Working.age..16.64....2015"
## [13] "Older.people.aged.65....2015"
## [14] "X..All.Children.aged.0.15...2015"
## [15] "X..All.Working.age..16.64....2015"
## [16] "X..All.Older.people.aged.65....2015"
## [17] "Mean.Age...2013"
## [18] "Median.Age...2013"
## [19] "Area...Square.Kilometres"
## [20] "Population.density..persons.per.sq.km....2013"
## [21] "X..BAME...2011"
## [22] "X..Not.Born.in.UK...2011"
## [23] "X..English.is.First.Language.of.no.one.in.household...2011"
## [24] "General.Fertility.Rate...2013"
## [25] "Male.life.expectancy..2009.13"
## [26] "Female.life.expectancy..2009.13"
## [27] "X..children.in.reception.year.who.are.obese...2011.12.to.2013.14"
## [28] "X..children.in.year.6.who.are.obese..2011.12.to.2013.14"
## [29] "Rate.of.All.Ambulance.Incidents.per.1.000.population...2014"
## [30] "Rates.of.ambulance.call.outs.for.alcohol.related.illness...2014"
## [31] "Number.Killed.or.Seriously.Injured.on.the.roads...2014"
## [32] "In.employment..16.64....2011"
## [33] "Employment.rate..16.64....2011"
## [34] "Number.of.jobs.in.area...2013"
## [35] "Employment.per.head.of.resident.WA.population...2013"
## [36] "Rate.of.new.registrations.of.migrant.workers...2011.12"
## [37] "Median.House.Price.......2014"
## [38] "Number.of.properties.sold...2014"
## [39] "Median.Household.income.estimate..2012.13."
## [40] "Number.of.Household.spaces...2011"
## [41] "X..detached.houses...2011"
## [42] "X..semi.detached.houses...2011"
## [43] "X..terraced.houses...2011"
## [44] "X..Flat..maisonette.or.apartment...2011"
## [45] "X..Households.Owned...2011"
## [46] "X..Households.Social.Rented...2011"
## [47] "X..Households.Private.Rented...2011"
## [48] "X..dwellings.in.council.tax.bands.A.or.B...2015"
## [49] "X..dwellings.in.council.tax.bands.C..D.or.E...2015"
## [50] "X..dwellings.in.council.tax.bands.F..G.or.H...2015"
## [51] "Claimant.rate.of.key.out.of.work.benefits..working.age.client.group...2014."
## [52] "Claimant.Rate.of.Housing.Benefit..2015."
## [53] "Claimant.Rate.of.Employment.Support.Allowance...2014"
## [54] "Rate.of.JobSeekers.Allowance..JSA..Claimants...2015"
## [55] "X..dependent.children..0.18..in.out.of.work.households...2014"
## [56] "X..of.households.with.no.adults.in.employment.with.dependent.children...2011"
## [57] "X..of.lone.parents.not.in.employment...2011"
## [58] "X.ID2010....Rank.of.average.score..within.London....2010"
## [59] "X.ID2010....of.LSOAs.in.worst.50..nationally...2010"
## [60] "Average.GCSE.capped.point.scores...2014"
## [61] "Unauthorised.Absence.in.All.Schools.......2013"
## [62] "X..with.no.qualifications...2011"
## [63] "X..with.Level.4.qualifications.and.above...2011"
## [64] "A.Level.Average.Point.Score.Per.Student...2013.14"
## [65] "A.Level.Average.Point.Score.Per.Entry..2013.14"
## [66] "Crime.rate...2014.15"
## [67] "Violence.against.the.person.rate...2014.15"
## [68] "Deliberate.Fires.per.1.000.population...2014"
## [69] "X..area.that.is.open.space...2014"
## [70] "Cars.per.household...2011"
## [71] "Average.Public.Transport.Accessibility.score...2014"
## [72] "X..travel.by.bicycle.to.work...2011"
## [73] "Turnout.at.Mayoral.election...2012"
## [74] "model1_resids"
## [75] "model2_resids"
#and calculate the centroids of all Wards in London
coordsW <- coordinates(LonWardProfilesSP)
plot(coordsW)
#Now we need to generate a spatial weights matrix (remember from the lecture a couple of weeks ago). We'll start with a simple binary matrix of queen's case neighbours
#create a neighbours list of queens contiguity
LWard_nb <- poly2nb(LonWardProfilesSP, queen=T)
#or nearest neighbours
knn_wards <- knearneigh(coordsW, k=4)## Warning in knearneigh(coordsW, k = 4): knearneigh: identical points found



#create a spatial weights matrix object from these weights
Lward.queens_weight <- nb2listw(LWard_nb, style="C")
Lward.knn_4_weight <- nb2listw(LWard_knn, style="C")
#now run a moran's I test on the residuals
#first using queens neighbours
moran.test(LonWardProfilesSP@data$model2_resids, Lward.queens_weight)##
## Moran I test under randomisation
##
## data: LonWardProfilesSP@data$model2_resids
## weights: Lward.queens_weight
##
## Moran I statistic standard deviate = 11.668, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2678354337 -0.0016000000 0.0005332379
##
## Moran I test under randomisation
##
## data: LonWardProfilesSP@data$model2_resids
## weights: Lward.knn_4_weight
##
## Moran I statistic standard deviate = 10.938, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2915804046 -0.0016000000 0.0007183984
Observing the Moran’s I statistic for both Queen’s case neighbours and k-nearest neighbours of 4, we can see that the Moran’s I statistic is somewhere between 0.27 and 0.29. Remembering that Moran’s I ranges from between -1 and +1 (0 indicating no spatial autocorrelation) we can conclude that there is some weak to moderate spatial autocorrelation in our residuals.
This means that despite passing most of the assumptions of linear regression, we could have a situation here where the presence of some spatial autocorrelation could be leading to biased estimates of our parameters and significance values.
9.4 Spatial Regression Models
9.4.2
We will now read in some extra data which we will use shortly
## Parsed with column specification:
## cols(
## WardName = col_character(),
## WardCode = col_character(),
## Wardcode = col_character(),
## PctSharedOwnership2011 = col_double(),
## PctRentFree2011 = col_double(),
## Candidate = col_character(),
## InnerOuter = col_character(),
## x = col_double(),
## y = col_double(),
## AvgGCSE2011 = col_double(),
## UnauthAbsenceSchools11 = col_double()
## )
#add the extra data too
LonWardProfiles <- left_join(LonWardProfiles, extradata, by = c("GSS_CODE" = "Wardcode"))
names(LonWardProfiles)## [1] "NAME"
## [2] "GSS_CODE"
## [3] "HECTARES"
## [4] "NONLD_AREA"
## [5] "LB_GSS_CD"
## [6] "BOROUGH"
## [7] "POLY_ID"
## [8] "Ward name"
## [9] "Old code"
## [10] "Population - 2015"
## [11] "Children aged 0-15 - 2015"
## [12] "Working-age (16-64) - 2015"
## [13] "Older people aged 65+ - 2015"
## [14] "% All Children aged 0-15 - 2015"
## [15] "% All Working-age (16-64) - 2015"
## [16] "% All Older people aged 65+ - 2015"
## [17] "Mean Age - 2013"
## [18] "Median Age - 2013"
## [19] "Area - Square Kilometres"
## [20] "Population density (persons per sq km) - 2013"
## [21] "% BAME - 2011"
## [22] "% Not Born in UK - 2011"
## [23] "% English is First Language of no one in household - 2011"
## [24] "General Fertility Rate - 2013"
## [25] "Male life expectancy -2009-13"
## [26] "Female life expectancy -2009-13"
## [27] "% children in reception year who are obese - 2011/12 to 2013/14"
## [28] "% children in year 6 who are obese- 2011/12 to 2013/14"
## [29] "Rate of All Ambulance Incidents per 1,000 population - 2014"
## [30] "Rates of ambulance call outs for alcohol related illness - 2014"
## [31] "Number Killed or Seriously Injured on the roads - 2014"
## [32] "In employment (16-64) - 2011"
## [33] "Employment rate (16-64) - 2011"
## [34] "Number of jobs in area - 2013"
## [35] "Employment per head of resident WA population - 2013"
## [36] "Rate of new registrations of migrant workers - 2011/12"
## [37] "Median House Price (£) - 2014"
## [38] "Number of properties sold - 2014"
## [39] "Median Household income estimate (2012/13)"
## [40] "Number of Household spaces - 2011"
## [41] "% detached houses - 2011"
## [42] "% semi-detached houses - 2011"
## [43] "% terraced houses - 2011"
## [44] "% Flat, maisonette or apartment - 2011"
## [45] "% Households Owned - 2011"
## [46] "% Households Social Rented - 2011"
## [47] "% Households Private Rented - 2011"
## [48] "% dwellings in council tax bands A or B - 2015"
## [49] "% dwellings in council tax bands C, D or E - 2015"
## [50] "% dwellings in council tax bands F, G or H - 2015"
## [51] "Claimant rate of key out-of-work benefits (working age client group) (2014)"
## [52] "Claimant Rate of Housing Benefit (2015)"
## [53] "Claimant Rate of Employment Support Allowance - 2014"
## [54] "Rate of JobSeekers Allowance (JSA) Claimants - 2015"
## [55] "% dependent children (0-18) in out-of-work households - 2014"
## [56] "% of households with no adults in employment with dependent children - 2011"
## [57] "% of lone parents not in employment - 2011"
## [58] "(ID2010) - Rank of average score (within London) - 2010"
## [59] "(ID2010) % of LSOAs in worst 50% nationally - 2010"
## [60] "Average GCSE capped point scores - 2014"
## [61] "Unauthorised Absence in All Schools (%) - 2013"
## [62] "% with no qualifications - 2011"
## [63] "% with Level 4 qualifications and above - 2011"
## [64] "A-Level Average Point Score Per Student - 2013/14"
## [65] "A-Level Average Point Score Per Entry; 2013/14"
## [66] "Crime rate - 2014/15"
## [67] "Violence against the person rate - 2014/15"
## [68] "Deliberate Fires per 1,000 population - 2014"
## [69] "% area that is open space - 2014"
## [70] "Cars per household - 2011"
## [71] "Average Public Transport Accessibility score - 2014"
## [72] "% travel by bicycle to work - 2011"
## [73] "Turnout at Mayoral election - 2012"
## [74] "model1_resids"
## [75] "model2_resids"
## [76] "WardName"
## [77] "WardCode"
## [78] "PctSharedOwnership2011"
## [79] "PctRentFree2011"
## [80] "Candidate"
## [81] "InnerOuter"
## [82] "x"
## [83] "y"
## [84] "AvgGCSE2011"
## [85] "UnauthAbsenceSchools11"
## [86] "geometry"
9.4.3 Extending your regression model - Dummy Variables
What if instead of fitting one line to our cloud of points, we could fit several depending on whether the Wards we were analysing fell into some or other group. What if the relationship between attending school and achieving good exam results varied between inner and outer London, for example. Could we test for that? Well yes we can - quite easily in fact.
If we colour the points representing Wards for Inner and Outer London differently, we can start to see that there might be something interesting going on. Using 2011 data (as there are not the rounding errors that there are in the more recent data), there seems to be a stronger relationship between absence and GCSE scores in Outer London than Inner London. We can test for this in a standard linear regression model.
p <- ggplot(LonWardProfiles, aes(x=`UnauthAbsenceSchools11`, y=`AvgGCSE2011`))
p + geom_point(aes(colour = InnerOuter)) 
Dummy variables are always categorical data (inner or outer London, or red / blue etc.). When we incorporate them into a regression model, they serve the purpose of splitting our analysis into groups. In the graph above, it would mean, effectively, having a separate regression line for the red points and a separate line for the blue points.
Let’s try it!
#first, let's make sure R is reading our InnerOuter variable as a factor
LonWardProfiles$InnerOuter <- as.factor(LonWardProfiles$InnerOuter)
#now run the model
model3 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter`, data = LonWardProfiles)
summary(model3)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter, data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.126 -9.874 -0.156 8.643 49.770
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 97.587 24.058 4.056
## `Unauthorised Absence in All Schools (%) - 2013` -30.097 2.028 -14.842
## log(`Median House Price (£) - 2014`) 19.834 1.742 11.384
## InnerOuterOuter 10.932 1.509 7.243
## Pr(>|t|)
## (Intercept) 5.62e-05 ***
## `Unauthorised Absence in All Schools (%) - 2013` < 2e-16 ***
## log(`Median House Price (£) - 2014`) < 2e-16 ***
## InnerOuterOuter 1.30e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.93 on 622 degrees of freedom
## Multiple R-squared: 0.5235, Adjusted R-squared: 0.5212
## F-statistic: 227.8 on 3 and 622 DF, p-value: < 2.2e-16
So how can we interpret this?
Well, the dummy variable is statistically significant and the coefficient tells us the difference between the two groups (Inner and Outer London) we are comparing. In this case, it is telling us that living in a Ward in outer London will improve your average GCSE score by 10.93 points, on average, compared to if you lived in Inner London. The R-squared has increased slightly, but not by much.
You will notice that despite there being two values in our dummy variable (Inner and Outer), we only get one coefficient. This is because with dummy variables, one value is always considered to be the control (comparison/reference) group. In this case we are comparing Outer London to Inner London. If our dummy variable had more than 2 levels we would have more coefficients, but always one as the reference.
The order in which the dummy comparisons are made is determined by what is known as a ‘contrast matrix’. This determines the treatment group (1) and the control (reference) group (0). We can view the contrast matrix using the contrasts() function:
## Outer
## Inner 0
## Outer 1
If we want to change the reference group, there are various ways of doing this. We can use the contrasts() function, or we can use the relevel() function. Let’s try it:
LonWardProfiles$InnerOuter <- relevel(LonWardProfiles$InnerOuter, ref="Outer")
model3 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter`, data = LonWardProfiles)
summary(model3)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter, data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.126 -9.874 -0.156 8.643 49.770
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 108.518 23.189 4.680
## `Unauthorised Absence in All Schools (%) - 2013` -30.097 2.028 -14.842
## log(`Median House Price (£) - 2014`) 19.834 1.742 11.384
## InnerOuterInner -10.932 1.509 -7.243
## Pr(>|t|)
## (Intercept) 3.53e-06 ***
## `Unauthorised Absence in All Schools (%) - 2013` < 2e-16 ***
## log(`Median House Price (£) - 2014`) < 2e-16 ***
## InnerOuterInner 1.30e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.93 on 622 degrees of freedom
## Multiple R-squared: 0.5235, Adjusted R-squared: 0.5212
## F-statistic: 227.8 on 3 and 622 DF, p-value: < 2.2e-16
You will notice that the only thing that has changed in the model is that the coefficient for the InnerOuter variable now relates to Inner London and is now negative (meaning that living in Inner London is likely to reduce your average GCSE score by 10.93 points compared to Outer London). The rest of the model is exactly the same.
9.4.4 TASK: Investigating Further - Adding More Explanatory Variables into a multiple regression model
Now it’s your turn. You have been shown how you could begin to model average GCSE scores across London, but the models we have built so far have been fairly simple in terms of explanatory variables.
You should try and build the optimum model of GCSE performance from your data in your LondonWards dataset. Experiment with adding different variables - when building a regression model in this way, you are trying to hit a sweet spot between increasing your R-squared value as much as possible, but with as few explanatory variables as possible.
9.4.4.1 A few things to watch out for…
You should never just throw variables at a model without a good theoretical reason for why they might have an influence. Choose your variables carefully!
Be prepared to take variables out of your model either if a new variable confounds (becomes more important than) earlier variables or turns out not to be significant.
For example, let’s try adding the rate of drugs related crime (logged as it is a positively skewed variable, where as the log is normal) and the number of cars per household… are these variables significant? What happens to the spatial errors in your models?
9.5 Task 3 - Spatial Non-stationarity and Geographically Weighted Regression Models (GWR)
Here’s my final model from the last section:
#select some variables from the data file
myvars <- c("Average GCSE capped point scores - 2014","Unauthorised Absence in All Schools (%) - 2013","Median House Price (£) - 2014","Rate of JobSeekers Allowance (JSA) Claimants - 2015", "% with Level 4 qualifications and above - 2011")
#check their correlations are OK
cormat <- cor(tempdf[myvars], use="complete.obs", method="pearson")
#run a final OLS model
model_final <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter` + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` + `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles)
summary(model_final)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` +
## `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.516 -9.083 0.164 9.163 45.572
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 239.93832 27.93079
## `Unauthorised Absence in All Schools (%) - 2013` -23.61673 2.16122
## log(`Median House Price (£) - 2014`) 8.41359 2.26238
## InnerOuterInner -10.36905 1.64637
## `Rate of JobSeekers Allowance (JSA) Claimants - 2015` -2.81347 0.63516
## `% with Level 4 qualifications and above - 2011` 0.41275 0.07836
## t value Pr(>|t|)
## (Intercept) 8.590 < 2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013` -10.927 < 2e-16 ***
## log(`Median House Price (£) - 2014`) 3.719 0.000218 ***
## InnerOuterInner -6.298 5.71e-10 ***
## `Rate of JobSeekers Allowance (JSA) Claimants - 2015` -4.430 1.12e-05 ***
## `% with Level 4 qualifications and above - 2011` 5.268 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.24 on 620 degrees of freedom
## Multiple R-squared: 0.568, Adjusted R-squared: 0.5645
## F-statistic: 163 on 5 and 620 DF, p-value: < 2.2e-16




## Variable "model_final_res" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
LonWardProfilesSP <- as(LonWardProfiles,"Spatial")
moran.test(LonWardProfilesSP@data$model_final_res, Lward.knn_4_weight)##
## Moran I test under randomisation
##
## data: LonWardProfilesSP@data$model_final_res
## weights: Lward.knn_4_weight
##
## Moran I statistic standard deviate = 8.4186, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2240092085 -0.0016000000 0.0007181894
Now, we probably could stop at running a spatial error model at this point, but it could be that rather than spatial autocorrelation causing problems with our model, it might be that a “global” regression model does not capture the full story. In some parts of our study area, the relationships between the dependent and independent variable may not exhibit the same slope coefficient. While, for example, increases in unauthorised absence usually are negatively correlated with GCSE score (students missing school results in lower exam scores), in some parts of the city, they could be positively correlated (in affluent parts of the city, rich parents may enrol their children for just part of the year and then live elsewhere in the world for another part of the year, leading to inflated unauthorised absence figures. Ski holidays are cheaper during the school term, but the pupils will still have all of the other advantages of living in a well off household that will benefit their exam scores.
If this occurs, then we have ‘non-stationarity’ - this is when the global model does not represent the relationships between variables that might vary locally.
This part of the practical will only skirt the edges of GWR, for much more detail you should visit the GWR website which is produced and maintained by Prof Chris Brunsdon and Dr Martin Charlton who originally developed the technique - http://gwr.nuim.ie/
There are various packages which will carry out GWR in R, for this pracical we we use spgwr (mainly because it was the first one I came across), although you could also use GWmodel or gwrr. At the end of this practical, you can test out these ideas in ArcGIS using the GWR toolset in the Spatial Statistics Toolbox - http://resources.arcgis.com/en/help/main/10.1/index.html#/Geographically_Weighted_Regression_GWR/005p00000021000000/
I should also acknowledge the guide on GWR (accessible here: http://www.bris.ac.uk/cmpo/events/2009/segregation/gwr.pdf) produced by the University of Bristol, which was a great help when producing this exercise.
## Warning: package 'spgwr' was built under R version 3.6.2
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
#calculate kernel bandwidth
GWRbandwidth <- gwr.sel(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter` + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` + `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles, coords=coordsW,adapt=T)## Adaptive q: 0.381966 CV score: 124832.2
## Adaptive q: 0.618034 CV score: 126396.1
## Adaptive q: 0.236068 CV score: 122752.4
## Adaptive q: 0.145898 CV score: 119960.5
## Adaptive q: 0.09016994 CV score: 116484.6
## Adaptive q: 0.05572809 CV score: 112628.7
## Adaptive q: 0.03444185 CV score: 109427.7
## Adaptive q: 0.02128624 CV score: 107562.9
## Adaptive q: 0.01315562 CV score: 108373.2
## Adaptive q: 0.02161461 CV score: 107576.6
## Adaptive q: 0.0202037 CV score: 107505.1
## Adaptive q: 0.01751157 CV score: 107333
## Adaptive q: 0.01584775 CV score: 107175.5
## Adaptive q: 0.01481944 CV score: 107564.8
## Adaptive q: 0.01648327 CV score: 107187.9
## Adaptive q: 0.01603246 CV score: 107143.9
## Adaptive q: 0.01614248 CV score: 107153.1
## Adaptive q: 0.01607315 CV score: 107147.2
## Adaptive q: 0.01596191 CV score: 107143
## Adaptive q: 0.01592122 CV score: 107154.4
## Adaptive q: 0.01596191 CV score: 107143
#run the gwr model
gwr.model = gwr(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` + log(`Median House Price (£) - 2014`) + `InnerOuter` + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` + `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles, coords=coordsW, adapt=GWRbandwidth, hatmatrix=TRUE, se.fit=TRUE)
#print the results of the model
gwr.model## Call:
## gwr(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013` +
## log(`Median House Price (£) - 2014`) + InnerOuter + `Rate of JobSeekers Allowance (JSA) Claimants - 2015` +
## `% with Level 4 qualifications and above - 2011`, data = LonWardProfiles,
## coords = coordsW, adapt = GWRbandwidth, hatmatrix = TRUE,
## se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.01596191 (about 9 of 626 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu.
## X.Intercept. -345.016491 105.265894
## X.Unauthorised.Absence.in.All.Schools.......2013. -47.061203 -30.804616
## log..Median.House.Price.......2014.. -23.968275 -0.027145
## InnerOuterInner -25.997740 -10.133154
## X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015. -18.958945 -5.511328
## X...with.Level.4.qualifications.and.above...2011. -1.175642 0.055792
## Median 3rd Qu.
## X.Intercept. 206.279057 337.189642
## X.Unauthorised.Absence.in.All.Schools.......2013. -22.971543 -14.344216
## log..Median.House.Price.......2014.. 10.924285 18.809942
## InnerOuterInner -5.871914 -1.756184
## X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015. -3.278698 -1.114797
## X...with.Level.4.qualifications.and.above...2011. 0.509708 0.832212
## Max. Global
## X.Intercept. 699.850993 239.9383
## X.Unauthorised.Absence.in.All.Schools.......2013. 6.798700 -23.6167
## log..Median.House.Price.......2014.. 44.788735 8.4136
## InnerOuterInner 11.694249 -10.3690
## X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015. 20.033764 -2.8135
## X...with.Level.4.qualifications.and.above...2011. 3.042313 0.4127
## Number of data points: 626
## Effective number of parameters (residual: 2traceS - traceS'S): 160.9269
## Effective degrees of freedom (residual: 2traceS - traceS'S): 465.0731
## Sigma (residual: 2traceS - traceS'S): 12.35905
## Effective number of parameters (model: traceS): 116.0071
## Effective degrees of freedom (model: traceS): 509.9929
## Sigma (model: traceS): 11.80222
## Sigma (ML): 10.65267
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 5026.882
## AIC (GWR p. 96, eq. 4.22): 4854.513
## Residual sum of squares: 71038.1
## Quasi-global R2: 0.7557128
The output from the GWR model reveals how the coefficients vary across the 625 Wards in our London Study region. You will see how the global coefficients are exactly the same as the coefficients in the earlier lm model. In this particular model (yours will look a little different if you have used different explanatory variables), if we take unauthorised absence from school, we can see that the coefficients range from a minimum value of -47.06 (1 unit change in unauthorised absence resulting in a drop in average GCSE score of -47.06) to +6.8 (1 unit change in unauthorised absence resulting in an increase in average GCSE score of +6.8). For half of the wards in the dataset, as unauthorised absence rises by 1 point, GCSE scores will decrease between -30.80 and -14.34 points (the inter-quartile range between the 1st Qu and the 3rd Qu).
You will notice that the R-Squared value has improved - this is not uncommon for GWR models, but it doesn’t necessarily mean they are definitely better than global models. The small number of cases under the kernel means that GW models have been criticised for lacking statistical robustness.
Coefficient ranges can also be seen for the other variables and they suggest some interesting spatial patterning. To explore this we can plot the GWR coefficients for different variables. Firstly we can attach the coefficients to our original dataframe - this can be achieved simply as the coefficients for each ward appear in the same order in our spatial points dataframe as they do in the original dataframe.
## [1] "sum.w"
## [2] "X.Intercept."
## [3] "X.Unauthorised.Absence.in.All.Schools.......2013."
## [4] "log..Median.House.Price.......2014.."
## [5] "InnerOuterInner"
## [6] "X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015."
## [7] "X...with.Level.4.qualifications.and.above...2011."
## [8] "X.Intercept._se"
## [9] "X.Unauthorised.Absence.in.All.Schools.......2013._se"
## [10] "log..Median.House.Price.......2014.._se"
## [11] "InnerOuterInner_se"
## [12] "X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015._se"
## [13] "X...with.Level.4.qualifications.and.above...2011._se"
## [14] "gwr.e"
## [15] "pred"
## [16] "pred.se"
## [17] "localR2"
## [18] "X.Intercept._se_EDF"
## [19] "X.Unauthorised.Absence.in.All.Schools.......2013._se_EDF"
## [20] "log..Median.House.Price.......2014.._se_EDF"
## [21] "InnerOuterInner_se_EDF"
## [22] "X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015._se_EDF"
## [23] "X...with.Level.4.qualifications.and.above...2011._se_EDF"
## [24] "pred.se.1"
## [25] "coord.x"
## [26] "coord.y"
#attach coefficients to original dataframe
LonWardProfilesSP@data$coefUnauthAbs<-results$X.Unauthorised.Absence.in.All.Schools.......2013.
LonWardProfilesSP@data$coefHousePrice<-results$log..Median.House.Price.......2014..
LonWardProfilesSP@data$coefJSA<-results$X.Rate.of.JobSeekers.Allowance..JSA..Claimants...2015.
LonWardProfilesSP@data$coefLev4Qual<-results$X...with.Level.4.qualifications.and.above...2011.## Variable "coefUnauthAbs" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable "coefHousePrice" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.